Image source zmescience
The SARS-CoV-2 virus causes Coronavirus Disease (COVID-19), an infectious disease. The majority of patients infected with COVID-19 will have mild to moderate symptoms and will recover without any additional therapy. Some, on the other hand, will become critically unwell and require medical assistance.
The virus can spread from an infected person’s mouth or nose in small liquid particles when they cough, sneeze, speak, sing or breathe. These particles range from larger respiratory droplets to smaller aerosols. You can be infected by breathing in the virus if you are near someone who has COVID-19, or by touching a contaminated surface and then your eyes, nose or mouth. The virus spreads more easily indoors and in crowded settings.
This is a comprehensive analysis report of the Novel Coronavirus (COVID-19) around the world, to demonstrate data processing and visualization, insights and prediction.
Here we are basically given with three main dataset.
As a first step let’s look at each one of them. Here as we can see, for the first table, we have the country name, latitude, longitude information, and then the number of cases confirmed as the time progress. Similarly we can can see the second and thrid dataset we can see the death rate and recovery rate as the time progress.
raw.data.confirmed <- read.csv('time_series_covid19_confirmed_global.csv')
head(raw.data.confirmed, n=5L)
raw.data.deaths <- read.csv('time_series_covid19_deaths_global.csv')
head(raw.data.deaths, n=5L)
raw.data.recovered <- read.csv('time_series_covid19_recovered_global.csv')
head(raw.data.recovered, n=5L)
Here we can observe some discrepancy in last dataframe. The first two dataframe consist of 284 observation of 826 variables. While, the third column consist of 269 observation of 826 variables. This means that recovery information of few instances is not yet available. Here we can also observe that Province.state column is empty for all observations, hence we can drop that column in further analysis.
str(raw.data.confirmed,list.len=10)
## 'data.frame': 284 obs. of 826 variables:
## $ Province.State: chr "" "" "" "" ...
## $ Country.Region: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ Lat : num 33.9 41.2 28 42.5 -11.2 ...
## $ Long : num 67.71 20.17 1.66 1.52 17.87 ...
## $ X1.22.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.23.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.24.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.25.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.26.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.27.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
str(raw.data.deaths,list.len=10)
## 'data.frame': 284 obs. of 826 variables:
## $ Province.State: chr "" "" "" "" ...
## $ Country.Region: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ Lat : num 33.9 41.2 28 42.5 -11.2 ...
## $ Long : num 67.71 20.17 1.66 1.52 17.87 ...
## $ X1.22.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.23.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.24.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.25.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.26.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.27.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
str(raw.data.recovered, list.len=10)
## 'data.frame': 269 obs. of 826 variables:
## $ Province.State: chr "" "" "" "" ...
## $ Country.Region: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ Lat : num 33.9 41.2 28 42.5 -11.2 ...
## $ Long : num 67.71 20.17 1.66 1.52 17.87 ...
## $ X1.22.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.23.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.24.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.25.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.26.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.27.20 : int 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
Here as we can see, the columns are with dates and rows are with counts of its corresponding confirmed cases, detaths and recovery counts. For proper analysis of this data let’s reshape the data into a longer dataset.
raw.data.confirmed <- raw.data.confirmed %>% pivot_longer(cols = starts_with("X"), names_to = "Date", values_to = "confirmed_count")
raw.data.confirmed
raw.data.deaths <- raw.data.deaths %>% pivot_longer(cols = starts_with("X"), names_to = "Date", values_to = "death_count")
raw.data.deaths
raw.data.recovered <- raw.data.recovered %>% pivot_longer(cols = starts_with("X"), names_to = "Date", values_to = "recovered_count")
raw.data.recovered
Now here we can see the date and its corresponding counts (confirmed, dath and recovery), in that particular dates. Now we can observe that these dates.
raw.data.confirmed$Date <- substr(raw.data.confirmed$Date,2,20)
raw.data.confirmed
raw.data.deaths$Date <- substr(raw.data.deaths$Date,2,20)
raw.data.deaths
raw.data.recovered$Date <- substr(raw.data.recovered$Date,2,20)
raw.data.recovered
Now lets merge three dataframes for easy comparison of the data.
data = merge(x = raw.data.confirmed , y = raw.data.deaths, by = c("Province.State","Country.Region","Lat","Long","Date"))
data = merge(x = data , y = raw.data.recovered, by = c("Province.State","Country.Region","Lat","Long","Date"))
data <- data[order(as.Date(data$Date, format="%m.%d.%Y")),]
data
Here we merged, three data frames and ordered as per date.
Analysing the missingness in data is another important aspect of data analysis. Eventhough for this analysis we are not trying to impute missing data, let’s explore analyse how much missing data is present.
data[data==""]<-NA
miss_var_summary(data)
data <- separate(data, Date, into=c("month", "day", "year"), sep="\\.", remove = FALSE)
data
From here onwards we are going to analyze data. Here our analysis is divided based on perpectives. By prepective, it meases that we are analysing data in different point of view which includes,
First let’s store our data into a world_data variable so that we could manupulate these without disturbing our base data.
world_data <- data
world_data_by_countries <- world_data %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
Lat=max(Lat),
Long=max(Long))
world_data_by_countries
Here, you can see severe discrepancy within cases confirmed and the corresponding deaths and recovered data. This is happened because after certain time frame, the data regarding deaths and recovered is not made available within database.
Now, Let’s explore some quick stats about novel coronavirus 2019.
Here is the top 20 countries having highest number of confirmed cases.
top_20_countries_c <- top_n(world_data_by_countries, 20, confirmed)
confirmed_plot <- ggplot(top_20_countries_c, aes(x=Country.Region, y=confirmed)) + geom_bar(stat="identity", width=0.7, position = "dodge", aes(fill=confirmed)) + coord_flip() + scale_fill_continuous(type = "viridis") + scale_y_log10() + labs(x="\nCountry", y="Confirmed cases\n") + theme_bw() +
theme(axis.text.x=element_text(angle=45, vjust=0.5))
confirmed_plot
top_20_countries_r <- top_n(world_data_by_countries, 20, recovered)
recovered_plot <- ggplot(top_20_countries_r, aes(x=Country.Region, y=recovered)) + geom_bar(stat="identity", width=0.7, position = "dodge", aes(fill=recovered)) + coord_flip() + scale_fill_continuous(type = "viridis") + scale_y_log10() + labs(x="\nCountry", y="Recovered cases\n") + theme_bw() +
theme(axis.text.x=element_text(angle=45, vjust=0.5))
recovered_plot
Here is the top 20 countries having highest number of confirmed cases.
top_20_countries_d <- top_n(world_data_by_countries, 20, death)
death_plot <- ggplot(top_20_countries_d, aes(x=Country.Region, y=death)) + geom_bar(stat="identity", width=0.7, position = "dodge", aes(fill=death)) + coord_flip() + scale_fill_continuous(type = "viridis") + scale_y_log10() + labs(x="\nCountry", y="Death count\n") + theme_bw() +
theme(axis.text.x=element_text(angle=45, vjust=0.5))
death_plot
Now, let’s see the countries/Regions least affected by COVID-19 virus.
least_20_countries_c <- top_n(world_data_by_countries, -20, confirmed)
least_c <- ggplot(least_20_countries_c, aes(x=Country.Region, y=confirmed)) + geom_bar(stat="identity", width=0.7, position = "dodge", aes(fill=confirmed)) + coord_flip()
least_c
least_20_countries_r <- top_n(world_data_by_countries, -20, recovered)
least_r <- ggplot(least_20_countries_r, aes(x=Country.Region, y=recovered)) + geom_bar(stat="identity", width=0.7, position = "dodge", aes(fill=recovered)) + coord_flip()
least_r
least_20_countries_d <- top_n(world_data_by_countries, -20, death)
least_d <- ggplot(least_20_countries_d, aes(x=Country.Region, y=death)) + geom_bar(stat="identity", width=0.7, position = "dodge", aes(fill=death)) + coord_flip()
least_d
### Plot - Top Affected (Map)
top_20_countries_d
leaflet(options=leafletOptions(dragging=FALSE, minzoom=18, maxzoom=18, nowrap=TRUE)) %>% addProviderTiles("CartoDB", group="CartoBD") %>%
addCircleMarkers(data = top_20_countries_c, lng = ~Long, lat = ~Lat, label = ~Country.Region, radius= 0.2, group="Top 20 confirmed") %>%
addCircleMarkers(data = top_20_countries_r, lng = ~Long, lat = ~Lat, label = ~Country.Region, radius= 0.2, group="Top 20 recovered") %>%
addCircleMarkers(data = top_20_countries_d, lng = ~Long, lat = ~Lat, label = ~Country.Region, radius= 0.2, group="Top 20 death") %>%
addCircleMarkers(data = least_20_countries_c, lng = ~Long, lat = ~Lat, label = ~Country.Region, radius= 0.2, group="Least 20 confirmed") %>%
addCircleMarkers(data= least_20_countries_r, lng = ~Long, lat = ~Lat, label = ~Country.Region, radius= 0.2, group="Least 20 recovered") %>%
addCircleMarkers(data= least_20_countries_d, lng = ~Long, lat = ~Lat, label = ~Country.Region, radius= 0.2, group="Least 20 deaths") %>%
addLayersControl(baseGroups = c("Top 20 confirmed","Top 20 recovered","Top 20 death","Least 20 confirmed", "Least 20 recovered", "Least 20 deaths"), options = layersControlOptions(collapsed = FALSE))
Now, let’s analyze how all these begin and how the virus got progressed.
So here let’s first filter first 6 months covid 19 data.
world_data <- data
world_data <- transform(world_data, month = as.numeric(month),
year = as.numeric(year), day=as.numeric(day)) %>% mutate_at(c("confirmed_count"), ~(scale(.)*10 %>% as.vector))
first_months <- world_data %>% filter(year==20) %>% filter(month==1) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
second_months <- world_data %>% filter(year==20) %>% filter(month==2) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
third_months <- world_data %>% filter(year==20) %>% filter(month==3) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
forth_months <- world_data %>% filter(year==20) %>% filter(month==4) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
fifth_months <- world_data %>% filter(year==20) %>% filter(month==5) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
sixth_months <- world_data %>% filter(year==20) %>% filter(month==6) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
second_6_months <- world_data %>% filter(year==20) %>% filter(month<=12) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
world_data
So here let’s first filter first 6 months covid 19 data.
library(leaflet)
pal = colorNumeric(
palette = "viridis",
domain = world_data$confirmed
)
leaflet() %>%
addProviderTiles("CartoDB", group="CartoBD",options=providerTileOptions(nowrap=TRUE)) %>%
addCircleMarkers(data = first_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(first_months$confirmed), radius= ~confirmed*4, group="First month") %>%
addCircleMarkers(data = second_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(second_months$confirmed), radius= ~confirmed*4, group="Second month") %>%
addCircleMarkers(data = third_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(third_months$confirmed), radius= ~confirmed*4, group="Third month") %>%
addCircleMarkers(data = forth_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(forth_months$confirmed), radius= ~confirmed*4, group="Forth month") %>%
addCircleMarkers(data= fifth_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(fifth_months$confirmed), radius= ~confirmed*4, group="Fifth month") %>%
addCircleMarkers(data= sixth_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(sixth_months$confirmed), radius= ~confirmed*4, group="Sixth month") %>%
addCircleMarkers(data= second_6_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(second_6_months$confirmed), radius= ~confirmed*1, group="last 6 months") %>%
addLayersControl(baseGroups = c("First month","Second month","Third month","Forth month", "Fifth month", "Sixth month","last 6 months"), options = layersControlOptions(collapsed = FALSE))
# world_data <- data
# geo_code_merger <- select(geo_code, country, code) %>% group_by(country) %>% summarise(code=max(code))
# world_data_v2 <- merge(x = world_data , y = geo_code_merger, by.x = c("Country.Region"), by.y = ("country"))
# world_data_v2 <- world_data_v2[order(as.Date(world_data_v2$Date, format="%m.%d.%Y")),]
# df <- read.csv("graph.csv")
#p <- plot_geo(geo_code, locationmode = 'world') %>%
#add_trace( z = geo_code$new_cases_per_million, locations = geo_code$code, frame=geo_code$start_of_week,
#color = geo_code$new_cases_per_million) %>% colorbar(title = "Timeline")
# p
#export as html file
# htmlwidgets::saveWidget(p, file = "map.html")
world_data <- data
first_months <- world_data %>% filter(year==21) %>% filter(month==1) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
second_months <- world_data %>% filter(year==21) %>% filter(month==2) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
third_months <- world_data %>% filter(year==21) %>% filter(month==3) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
forth_months <- world_data %>% filter(year==21) %>% filter(month==4) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
fifth_months <- world_data %>% filter(year==21) %>% filter(month==5) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
sixth_months <- world_data %>% filter(year==21) %>% filter(month==6) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
second_6_months <- world_data %>% filter(year==21) %>% filter(month<=12) %>% group_by(Country.Region) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count),
lat = mean(Lat),
long = mean(Long))
pal = colorNumeric(
palette = "viridis",
domain = world_data$confirmed
)
leaflet() %>%
addProviderTiles("CartoDB", group="CartoBD",options=providerTileOptions(nowrap=TRUE)) %>%
addCircleMarkers(data = first_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(first_months$confirmed), radius= ~confirmed*0.000002, group="First month") %>%
addCircleMarkers(data = second_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(second_months$confirmed), radius= ~confirmed*0.000002, group="Second month") %>%
addCircleMarkers(data = third_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(third_months$confirmed), radius= ~third_months$confirmed*0.000002, group="Third month") %>%
addCircleMarkers(data = forth_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(forth_months$confirmed), radius= ~confirmed*0.000002, group="Forth month") %>%
addCircleMarkers(data= fifth_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(fifth_months$confirmed), radius= ~confirmed*0.000002, group="Fifth month") %>%
addCircleMarkers(data= sixth_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(sixth_months$confirmed), radius= ~confirmed*0.000002, group="Sixth month") %>%
addCircleMarkers(data= second_6_months, lng = ~long, lat = ~lat, label = ~Country.Region, color=~pal(second_6_months$confirmed), radius= ~confirmed*0.000002, group="last 6 months") %>%
addLayersControl(baseGroups = c("First month","Second month","Third month","Forth month", "Fifth month", "Sixth month","last 6 months"), options = layersControlOptions(collapsed = FALSE))
16.21 % of world`s corona virus are from US
8.62 % of world`s corona virus are from India
6.08 % of world`s corona virus are from Brazil
5.49 % of world`s corona virus are from France
4.84 % of world`s corona virus are from Germany
4.39 % of world`s corona virus are from United Kingdom
3.58 % of world`s corona virus are from Russia
3.37 % of world`s corona virus are from Korea, South
3.21 % of world`s corona virus are from Italy
3.01 % of world`s corona virus are from Turkey
world_data <- data %>% filter(year!=22)
world_data_by_countries <- world_data %>% group_by(Country.Region, month) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count), .groups = 'drop') %>% arrange(month)
world_data_by_countries <- world_data_by_countries %>% group_by(month) %>%
summarise(confirmed = sum(confirmed),
death = sum(death),
recovered = sum(recovered)) %>% arrange(month) %>% arrange(as.integer(month))
world_data_by_countries['confirmed_rev_cumsum'] <- c(world_data_by_countries$confirmed[1],diff(world_data_by_countries$confirmed))
world_data_by_countries['death_rev_cumsum'] <- c(world_data_by_countries$death[1],diff(world_data_by_countries$death))
world_data_by_countries['recovered_rev_cumsum'] <- c(world_data_by_countries$recovered[1],diff(world_data_by_countries$recovered))
template1 <- '
<div class="container-fluid bg-warning" style="padding:10px 20px;color: white;background-image: linear-gradient(to left bottom, #a6e90d, #58d056, #00b374, #00937d, #2e7171);">
<h4>Novel COVID 19 Stats Monthly status 2020 & 2021</h4>
<hr style="border-top: 1px solid white;">
<b>According to data of 2020 and 2021, each months observed,</b>
'
template2 <- '
<p>In the month of %s we have observed `%0.2f` %% total confirmed cases and a death rate of `%0.2f`</p>
'
mnths <- c("Jan", "February","March","April","May","June", "July", "Auguest", "September", "October","November","December")
cat(template1)
According to data of 2020 and 2021, each months observed,
for (i in seq(nrow(world_data_by_countries))) {
current <- world_data_by_countries[i, ]
cat(sprintf(template2, mnths[as.integer(current$month)], (current$confirmed_rev_cumsum/sum(world_data_by_countries$confirmed_rev_cumsum))*100,(current$death_rev_cumsum/sum(world_data_by_countries$death_rev_cumsum))*100))
}
In the month of Jan we have observed 35.84 % total confirmed cases and a death rate of 42.13
In the month of February we have observed 3.90 % total confirmed cases and a death rate of 5.72
In the month of March we have observed 5.12 % total confirmed cases and a death rate of 5.52
In the month of April we have observed 7.80 % total confirmed cases and a death rate of 6.98
In the month of May we have observed 6.80 % total confirmed cases and a death rate of 7.00
In the month of June we have observed 3.91 % total confirmed cases and a death rate of 5.23
In the month of July we have observed 5.46 % total confirmed cases and a death rate of 4.98
In the month of Auguest we have observed 6.89 % total confirmed cases and a death rate of 5.53
In the month of September we have observed 5.56 % total confirmed cases and a death rate of 4.85
In the month of October we have observed 4.53 % total confirmed cases and a death rate of 3.98
In the month of November we have observed 5.44 % total confirmed cases and a death rate of 3.98
In the month of December we have observed 8.74 % total confirmed cases and a death rate of 4.10
world_data <- data %>% filter(year==20)
world_data_by_countries <- world_data %>% group_by(Country.Region, month) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count), .groups = 'drop') %>% arrange(month)
world_data_by_countries <- world_data_by_countries %>% group_by(month) %>%
summarise(confirmed = sum(confirmed),
death = sum(death),
recovered = sum(recovered)) %>% arrange(month) %>% arrange(as.integer(month))
world_data_by_countries['confirmed_rev_cumsum'] <- c(world_data_by_countries$confirmed[1],diff(world_data_by_countries$confirmed))
world_data_by_countries['death_rev_cumsum'] <- c(world_data_by_countries$death[1],diff(world_data_by_countries$death))
world_data_by_countries['recovered_rev_cumsum'] <- c(world_data_by_countries$recovered[1],diff(world_data_by_countries$recovered))
template1 <- '
<div class="container-fluid bg-warning" style="padding:10px 20px;color: white; background-image: linear-gradient(to left bottom, #051937, #3c405e, #6e6c87, #a29cb3, #d7cfe1);">
<h4>Novel COVID 19 Stats Monthly status 2020</h4>
<hr style="border-top: 1px solid white;">
<b>According to data of 2020, each months observed,</b>
'
template2 <- '
<p>In the month of %s we have observed `%0.2f` %% total confirmed cases and a death rate of `%0.2f`</p>
'
mnths <- c("Jan", "February","March","April","May","June", "July", "Auguest", "September", "October","November","December")
cat(template1)
According to data of 2020, each months observed,
world_data_by_countries_bkp <- world_data_by_countries
for (i in seq(nrow(world_data_by_countries))) {
current <- world_data_by_countries[i, ]
cat(sprintf(template2, mnths[as.integer(current$month)], (current$confirmed_rev_cumsum/sum(world_data_by_countries$confirmed_rev_cumsum))*100, (current$death_rev_cumsum/sum(world_data_by_countries$death_rev_cumsum))*100))
}
In the month of Jan we have observed 0.01 % total confirmed cases and a death rate of 0.01
In the month of February we have observed 0.08 % total confirmed cases and a death rate of 0.14
In the month of March we have observed 0.92 % total confirmed cases and a death rate of 2.22
In the month of April we have observed 2.84 % total confirmed cases and a death rate of 10.32
In the month of May we have observed 3.45 % total confirmed cases and a death rate of 7.96
In the month of June we have observed 5.15 % total confirmed cases and a death rate of 7.67
In the month of July we have observed 8.55 % total confirmed cases and a death rate of 9.44
In the month of Auguest we have observed 9.54 % total confirmed cases and a death rate of 9.83
In the month of September we have observed 10.17 % total confirmed cases and a death rate of 9.00
In the month of October we have observed 14.47 % total confirmed cases and a death rate of 9.83
In the month of November we have observed 20.57 % total confirmed cases and a death rate of 14.71
In the month of December we have observed 24.26 % total confirmed cases and a death rate of 18.85
world_data <- data %>% filter(year==21)
world_data_by_countries <- world_data %>% group_by(Country.Region, month) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count), .groups = 'drop') %>% arrange(month)
world_data_by_countries <- world_data_by_countries %>% group_by(month) %>%
summarise(confirmed = sum(confirmed),
death = sum(death),
recovered = sum(recovered)) %>% arrange(month) %>% arrange(as.integer(month))
world_data_by_countries['confirmed'] <- world_data_by_countries$confirmed - sum(world_data_by_countries_bkp$confirmed_rev_cumsum)
world_data_by_countries['confirmed_rev_cumsum'] <- c(world_data_by_countries$confirmed[1],diff(world_data_by_countries$confirmed))
world_data_by_countries['death_rev_cumsum'] <- c(world_data_by_countries$death[1],diff(world_data_by_countries$death))
world_data_by_countries['recovered_rev_cumsum'] <- c(world_data_by_countries$recovered[1],diff(world_data_by_countries$recovered))
template1 <- '
<div class="container-fluid bg-warning" style="padding:10px 20px;color: white; background-image: linear-gradient(to right top, #051937, #004d7a, #008793, #00bf72, #a8eb12);">
<h4>Novel COVID 19 Stats Monthly status 2021</h4>
<hr style="border-top: 1px solid white;">
<b>According to data of 2021, each months observed,</b>
'
template2 <- '
<p>In the month of %s we have observed `%0.2f` %% total confirmed cases and a death rate of `%0.2f`</p>
'
mnths <- c("Jan", "February","March","April","May","June", "July", "Auguest", "September", "October","November","December")
cat(template1)
According to data of 2021, each months observed,
for (i in seq(nrow(world_data_by_countries))) {
current <- world_data_by_countries[i, ]
cat(sprintf(template2, mnths[as.integer(current$month)], (current$confirmed_rev_cumsum/sum(world_data_by_countries$confirmed_rev_cumsum))*100,(current$death_rev_cumsum/sum(world_data_by_countries$death_rev_cumsum))*100))
}
In the month of Jan we have observed 9.54 % total confirmed cases and a death rate of 42.13
In the month of February we have observed 5.50 % total confirmed cases and a death rate of 5.72
In the month of March we have observed 7.22 % total confirmed cases and a death rate of 5.52
In the month of April we have observed 11.00 % total confirmed cases and a death rate of 6.98
In the month of May we have observed 9.59 % total confirmed cases and a death rate of 7.00
In the month of June we have observed 5.52 % total confirmed cases and a death rate of 5.23
In the month of July we have observed 7.70 % total confirmed cases and a death rate of 4.98
In the month of Auguest we have observed 9.71 % total confirmed cases and a death rate of 5.53
In the month of September we have observed 7.83 % total confirmed cases and a death rate of 4.85
In the month of October we have observed 6.39 % total confirmed cases and a death rate of 3.98
In the month of November we have observed 7.67 % total confirmed cases and a death rate of 3.98
In the month of December we have observed 12.32 % total confirmed cases and a death rate of 4.10
world_data <- data
world_data_by_countries <- world_data %>% group_by(Country.Region, month) %>%
summarise(confirmed = max(confirmed_count),
death = max(death_count),
recovered = max(recovered_count), .groups = 'drop') %>% arrange(month)
c1 <- world_data_by_countries %>% filter(month==c(10))
c2 <- world_data_by_countries %>% filter(month==c(12))
c2$confirmed <- c2$confirmed-c1$confirmed
c2$death <- c2$death-c1$death
c2 <- top_n(c2, 10, confirmed) %>% arrange(-confirmed)
template1 <- '
<div class="container-fluid bg-warning" style="padding:10px 20px;color: white; background-image: linear-gradient(to right top, #051937, #004d7a, #008793, #00bf72, #a8eb12);">
<h4>Country with highest confirmed/death rates in last Quarter(Q4)</h4>
<hr style="border-top: 1px solid white;">
<b>According to data in Quarter(Q4),</b>
'
template2 <- '
<p>In %s we have observed `%0.2f` %% total confirmed cases and a death rate of `%0.2f`</p>
'
cat(template1)
According to data in Quarter(Q4),
for (i in seq(nrow(c2))) {
current <- c2[i, ]
cat(sprintf(template2, current$Country.Region, (current$confirmed/sum(c2$confirmed))*100,(current$death/sum(c2$death))*100))
}
In US we have observed 33.68 % total confirmed cases and a death rate of 36.45
In United Kingdom we have observed 14.85 % total confirmed cases and a death rate of 3.62
In France we have observed 10.61 % total confirmed cases and a death rate of 2.68
In Germany we have observed 9.77 % total confirmed cases and a death rate of 7.34
In Russia we have observed 7.43 % total confirmed cases and a death rate of 31.03
In Turkey we have observed 5.55 % total confirmed cases and a death rate of 5.32
In Italy we have observed 5.18 % total confirmed cases and a death rate of 2.40
In Spain we have observed 4.91 % total confirmed cases and a death rate of 0.92
In Poland we have observed 4.15 % total confirmed cases and a death rate of 9.09
In Netherlands we have observed 3.86 % total confirmed cases and a death rate of 1.14
| 2020 | 2021 |
|---|---|
| x | x + y + ………+ Z |
| x+y | x + y + ………+ Z + V |
| …. | …… |
| …. | …… |
Now, x + (x + y + ………+ Z) –(1)
x + y + (x + y + ………+ Z + V) –(2) (2) - (1) y+v, which is the sum of individual vaules of each month.Now, let’s look at the top countries that have been affected, considering their population. We utilized an additional dataset for this investigation, which consisted of the name of the nation and its population. Click here for dataset. Here we joined covid 19 data with population data.
Ratio of total affected people vs population
top_20_countries <- top_20_countries_c
# library(readr)
# csvData <- read_csv("csvData.csv")
# csvData$pop2022 <- csvData$pop2022 *1000
csvData[c(csvData$country=="United States"),]['country'] ='US'
csvData[c(csvData$country=="South Korea"),]['country'] = "Korea, South"
top_20_countries <- merge(x =top_20_countries , y = csvData, by.x = c("Country.Region"), by.y=c("country"))
top_20_countries <- top_20_countries[order(-top_20_countries$confirmed),]
top_20_countries['confirmed_to_pop_ratio'] <- top_20_countries$confirmed/top_20_countries$pop2022
top_20_countries['death_to_pop_ratio'] <- top_20_countries$death/top_20_countries$pop2022
top_20_countries <- top_20_countries[order(-top_20_countries$confirmed_to_pop_ratio),]
top_20_countries[c('Country.Region', 'confirmed', 'death', 'confirmed_to_pop_ratio')]
Ratio of total death people vs population
top_20_countries <- top_20_countries[order(-top_20_countries$death_to_pop_ratio),]
top_20_countries[c('Country.Region', 'confirmed', 'death', 'death_to_pop_ratio')]
Now lets select few top countries and analyse the data deeper. At first Let’s consider united states, the country that has shown very high number of confirmed cases
library(ggplot2)
data_by_country <- data
data_by_country$Date <- data_by_country$Date %>% as.Date("%m.%d.%y")
country <- data_by_country %>% group_by(Country.Region) %>% mutate(cumconfirmed=cumsum(confirmed_count), days = Date - first(Date) + 1)
US <- country %>% filter(Country.Region=="US")
country
ggplot(US, aes(x=days, y=confirmed_count)) + geom_line(color="red") +
theme_classic() +
labs(title = "Covid-19 United States Confirmed Cases", x= "Days", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
ggplot(US, aes(x=days, y=death_count)) + geom_line(color="red") +
theme_classic() +
labs(title = "Covid-19 United States Death Cases", x= "Days", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
ggplot(US, aes(x=days, y=recovered_count)) + geom_line(color="red") +
theme_classic() +
labs(title = "Covid-19 United States Recovered Cases", x= "Days", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
drop <- c("Province.State")
country = country[,!(names(country) %in% drop)]
# Some inconsistancy with UK data hence ignoring
country <- country %>% filter(!Country.Region=="United Kingdom")
country <- country %>% filter(Country.Region==c(top_20_countries$Country.Region))
world_perspective <- ggplot(country, aes(x=days, y=confirmed_count, group=Country.Region, color=Country.Region)) + geom_line() + labs(title = "Covid-19 Confirmed Cases in world perspective", x= "Days", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5))
world_perspective
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
Here, we have data regarding how these covid affected in different provinces of united states. For many countries these data is not even present.
us_data_confirmed <- read.csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv')
us_data_confirmed <- us_data_confirmed %>% pivot_longer(cols = starts_with("X"), names_to = "Date", values_to = "confirmed_count")
us_data_confirmed$Date <- substr(us_data_confirmed$Date,2,20)
us_data_confirmed <- us_data_confirmed %>% group_by(Province_State) %>% summarise(confirmed=max(confirmed_count), Lat=median(Lat), Long_=median(Long_))
lng<-mean(us_data_confirmed$Long_)
lat<-mean(us_data_confirmed$Lat)
pal = colorNumeric(
palette = "viridis",
domain = us_data_confirmed$`confirmed`
)
leaflet(us_data_confirmed) %>% addTiles() %>%
addCircleMarkers(lng = ~Long_, lat = ~Lat,
label = ~Province_State,
color=~pal(us_data_confirmed$confirmed),
radius= ~confirmed*0.000015)%>%
addLegend( "bottomright", pal = pal, values = ~confirmed,
title = "Total Affected",
labFormat = labelFormat(prefix = " "),
opacity = 0.75)%>%
setView(lat= 35, lng=-100,zoom=4)
data_by_country <- data
data_by_country$Date <- data_by_country$Date %>% as.Date("%m.%d.%y")
country <- data_by_country %>% group_by(Country.Region) %>% mutate(cumconfirmed=cumsum(confirmed_count), days = Date - first(Date) + 1)
Australia <- country %>% filter(Country.Region=="Australia") %>% group_by(Date) %>% mutate(confirmed_count=sum(confirmed_count),
death_count=sum(death_count),
recovered_count=sum(recovered_count))
ggplot(Australia, aes(x=days, y=confirmed_count)) + geom_line(color="red") +
theme_classic() +
labs(title = "Covid-19 Australia Confirmed Cases", x= "Days", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
ggplot(Australia, aes(x=days, y=death_count)) + geom_line(color="red") +
theme_classic() +
labs(title = "Covid-19 Australia Death Cases", x= "Days", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
ggplot(Australia, aes(x=days, y=recovered_count)) + geom_line(color="red") +
theme_classic() +
labs(title = "Covid-19 Australia Recovered Cases", x= "Days", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
# Some inconsistancy with UK data hence ignoring
country <- country %>% filter(!Country.Region=="United Kingdom")
country <- country %>% filter(Country.Region==c(top_20_countries$Country.Region))
world_perspective <- ggplot(country, aes(x=days, y=confirmed_count, group=Country.Region, color=Country.Region)) + geom_line() + theme_classic() +
labs(title = "Covid-19 Confirmed Cases in world perspective", x= "Days", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5))
world_perspective
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
Here, is the plot for how covid affected in different provinces of Australia.
data_by_country <- data
data_by_country$Date <- data_by_country$Date %>% as.Date("%m.%d.%y")
country <- data_by_country %>% group_by(Country.Region) %>% mutate(cumconfirmed=cumsum(confirmed_count), days = Date - first(Date) + 1)
us_data_confirmed <- country %>% filter(Country.Region=="Australia")
us_data_confirmed <- us_data_confirmed %>% group_by(Province.State) %>% summarise(confirmed=max(confirmed_count), Lat=median(Lat), Long_=median(Long))
lng<-mean(us_data_confirmed$Long_)
lat<-mean(us_data_confirmed$Lat)
pal = colorNumeric(
palette = "viridis",
domain = us_data_confirmed$`confirmed`
)
leaflet(us_data_confirmed) %>% addTiles() %>%
addCircleMarkers(lng = ~Long_, lat = ~Lat,
label = ~Province.State,
color=~pal(us_data_confirmed$confirmed),
radius= ~confirmed*0.000025)%>%
addLegend( "bottomright", pal = pal, values = ~confirmed,
title = "Total Affected",
labFormat = labelFormat(prefix = " "),
opacity = 0.75)%>%
setView(lat= -30, lng=140,zoom=4)
data_by_country <- data
data_by_country$Date <- data_by_country$Date %>% as.Date("%m.%d.%y")
country <- data_by_country %>% group_by(Country.Region) %>% mutate(cumconfirmed=cumsum(confirmed_count), days = Date - first(Date) + 1)
country <- country %>% filter(Country.Region==c(top_20_countries$Country.Region))
world_perspective <- ggplot(country, aes(x=days, y=confirmed_count, group=Country.Region, color=Country.Region)) + geom_line() +
theme_classic() +
labs(title = "Covid-19 Confirmed Cases in world perspective", x= "Days", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5)) + facet_wrap(~Country.Region)
world_perspective
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
world_perspective <- ggplot(country, aes(x=days, y=death_count, group=Country.Region, color=Country.Region)) + geom_line() +
theme_classic() +
labs(title = "Covid-19 Death Cases in world perspective", x= "Days", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5)) + facet_wrap(~Country.Region)
world_perspective
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
world_perspective <- ggplot(country, aes(x=days, y=recovered_count, group=Country.Region, color=Country.Region)) + geom_line() +
theme_classic() +
labs(title = "Covid-19 recovery Cases in world perspective", x= "Days", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5)) + facet_wrap(~Country.Region)
world_perspective
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
In this section we are going to analyze situation in India. Since the data required for this particular analysis not present in the CSSEGISandData/COVID-19 repo we are using another dataset for this purpose.
str(covid)
## 'data.frame': 18110 obs. of 9 variables:
## $ Sno : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Date : chr "2020-01-30" "2020-01-31" "2020-02-01" "2020-02-02" ...
## $ Time : chr "6:00 PM" "6:00 PM" "6:00 PM" "6:00 PM" ...
## $ State.UnionTerritory : chr "Kerala" "Kerala" "Kerala" "Kerala" ...
## $ ConfirmedIndianNational : chr "1" "1" "2" "3" ...
## $ ConfirmedForeignNational: chr "0" "0" "0" "0" ...
## $ Cured : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Deaths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Confirmed : int 1 1 2 3 3 3 3 3 3 3 ...
str(testing)
## 'data.frame': 16336 obs. of 5 variables:
## $ Date : chr "2020-04-17" "2020-04-24" "2020-04-27" "2020-05-01" ...
## $ State : chr "Andaman and Nicobar Islands" "Andaman and Nicobar Islands" "Andaman and Nicobar Islands" "Andaman and Nicobar Islands" ...
## $ TotalSamples: num 1403 2679 2848 3754 6677 ...
## $ Negative : int 1210 NA NA NA NA NA NA NA NA NA ...
## $ Positive : num 12 27 33 33 33 33 33 33 33 33 ...
str(vaccine)
## 'data.frame': 7644 obs. of 24 variables:
## $ Updated.On : chr "16/01/2021" "17/01/2021" "18/01/2021" "19/01/2021" ...
## $ State : chr "India" "India" "India" "India" ...
## $ Total.Doses.Administered : num 48276 58604 99449 195525 251280 ...
## $ Sessions : num 3455 8532 13611 17855 25472 ...
## $ Sites : num 2957 4954 6583 7951 10504 ...
## $ First.Dose.Administered : num 48276 58604 99449 195525 251280 ...
## $ Second.Dose.Administered : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Male..Doses.Administered. : num NA NA NA NA NA NA NA NA NA NA ...
## $ Female..Doses.Administered. : num NA NA NA NA NA NA NA NA NA NA ...
## $ Transgender..Doses.Administered. : num NA NA NA NA NA NA NA NA NA NA ...
## $ Covaxin..Doses.Administered. : num 579 635 1299 3017 3946 ...
## $ CoviShield..Doses.Administered. : num 47697 57969 98150 192508 247334 ...
## $ Sputnik.V..Doses.Administered. : num NA NA NA NA NA NA NA NA NA NA ...
## $ AEFI : num NA NA NA NA NA NA NA NA NA NA ...
## $ X18.44.Years..Doses.Administered. : num NA NA NA NA NA NA NA NA NA NA ...
## $ X45.60.Years..Doses.Administered. : num NA NA NA NA NA NA NA NA NA NA ...
## $ X60..Years..Doses.Administered. : num NA NA NA NA NA NA NA NA NA NA ...
## $ X18.44.Years.Individuals.Vaccinated.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X45.60.Years.Individuals.Vaccinated.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X60..Years.Individuals.Vaccinated. : num NA NA NA NA NA NA NA NA NA NA ...
## $ Male.Individuals.Vaccinated. : num 23757 27348 41361 81901 98111 ...
## $ Female.Individuals.Vaccinated. : num 24517 31252 58083 113613 153145 ...
## $ Transgender.Individuals.Vaccinated. : num 2 4 5 11 24 38 80 103 128 201 ...
## $ Total.Individuals.Vaccinated : num 48276 58604 99449 195525 251280 ...
Here, In this section we are planning to analyze indian data in different aspects.
full_covid_data <- inner_join(covid,testing, by=c("Date"="Date","State.UnionTerritory"="State"))
full_covid_data[is.na(full_covid_data)] <- 0
top_affected <- full_covid_data %>% group_by(State.UnionTerritory) %>% summarise(Cured=max(Cured), Deaths=max(Deaths), Confirmed=max(Confirmed)) %>%
select(State.UnionTerritory,Cured,Deaths,Confirmed) %>%
arrange(desc(Confirmed)) %>% top_n(10)
## Selecting by Confirmed
ta <- as.vector(top_affected[['State.UnionTerritory']])
full_covid_data$Date <- as.Date(full_covid_data$Date)
full_covid_data %>%
filter(State.UnionTerritory %in% ta) %>%
ggplot(aes(x=Date,y=Confirmed)) + geom_line(aes(color=State.UnionTerritory),size=1.2)+
scale_x_date(limit=c(as.Date("2020-04-01"),as.Date("2021-08-11"))) +
theme_classic() +
scale_y_continuous(labels=scales :: number_format(accuracy=1))+
labs(title='Time Series for Confirmed Cases',subtitle = 'Top affected states')+
xlab(label='Time Period') +
ylab(label='Confirmed Cases') +
scale_fill_viridis_d()
full_covid_data$Active = (full_covid_data$Confirmed-(full_covid_data$Deaths + full_covid_data$Cured))
full_covid_data %>%
filter(State.UnionTerritory %in% ta) %>%
ggplot(aes(x=Date,y=Active)) + geom_line(aes(color=State.UnionTerritory),size=1.2)+
scale_x_date(limit=c(as.Date("2020-04-01"),as.Date("2021-05-07"))) +
scale_y_continuous(labels=scales :: number_format(accuracy=1))+
labs(title='Time Series for Active Cases',subtitle = 'Top 10 worst affected states')+
theme_classic() +
xlab(label='Time Period') +
ylab(label='Active Cases') +
scale_fill_viridis_d()
full_covid_data %>%
filter(Date==max(Date)) %>%
ggplot(aes(x=Confirmed,y=State.UnionTerritory))+geom_col(fill='red',alpha=0.8)+
scale_x_continuous(labels=scales :: number_format(accuracy=1))+
theme_minimal() +
labs(title="Total Confirmed cases grouped by states")
full_covid_data %>%
filter(Date==max(Date)) %>%
ggplot(aes(x=Active,y=State.UnionTerritory))+geom_col(fill='green',alpha=0.8)+
scale_x_continuous(labels=scales :: number_format(accuracy=1))+
theme_light() +
labs(title="Total Confirmed cases grouped by states")
full_covid_data %>%
filter(Date==max(Date)) %>%
ggplot(aes(x=Deaths,y=State.UnionTerritory))+geom_col()+
scale_fill_viridis_d() +
scale_x_continuous(labels=scales :: number_format(accuracy=1))+
theme_light() +
labs(title="Total Confirmed cases grouped by states")
Here is the deatailed plot for the growth of covid in India.
india<-full_covid_data %>%
group_by(Date) %>%
summarise(Cured_tot=sum(Cured),
Deaths_tot=sum(Deaths),
Confirmed_tot=sum(Confirmed),
Active_tot=sum(Active))
plot_india <- india %>%
ggplot(aes(x=Date,y=Confirmed_tot)) + geom_line(color='blue',size=1) +
labs(title="Times series for Confirmed Cases")+
theme_linedraw() +
xlab(label ="Time Period") +
ylab(label="Confirmed Cases") +
scale_y_continuous(labels = scales :: number_format(accuracy=1))
plot_india
library(lubridate)
# The transmute method in dplyr allows you to add new variables, especially computed ones. Unlike mutate, the transmute will #remove other columns by default. A common data wrangling task is to create new columns using computations on existing columns.
tbl_covid_19_india <- covid
colnames(tbl_covid_19_india) <- sub("/", "", colnames(tbl_covid_19_india), fixed = TRUE)
tbl_covid_19_india <- tbl_covid_19_india %>% mutate(new_date = ymd(Date)) %>%
transmute(
Sno = Sno,
Date = new_date,
StateUnionTerritory = State.UnionTerritory,
ConfirmedIndianNational = ConfirmedIndianNational,
ConfirmedForeignNational = ConfirmedForeignNational,
Cured = Cured,
Deaths = Deaths,
Confirmed = Confirmed
)
# tbl_covid_19_india
tbl_deaths_percentage_1 <- inner_join(tbl_covid_19_india,
tbl_covid_19_india %>% group_by(StateUnionTerritory) %>%
summarise(max_date = max(Date)) %>% ungroup() %>%
transmute(StateUnionTerritory = StateUnionTerritory,
Date = max_date), by = c("StateUnionTerritory", "Date"))
# tbl_deaths_percentage_1
tbl_deaths_percentage <- mutate(tbl_deaths_percentage_1,
new_StateUnionTerritory = str_replace(StateUnionTerritory, "#", ""),
new_StateUnionTerritory1 = str_replace(new_StateUnionTerritory, "Andaman and Nicobar Islands", "Andaman & Nicobar")) %>%
transmute(state = new_StateUnionTerritory1,
Date = Date,
Cured = Cured,
Deaths = Deaths,
Confirmed = Confirmed)
# tbl_deaths_percentage
# COVID 19 India - Case Fatality Rate - % of Deaths/Confirmed Cases
p_death <- tbl_deaths_percentage %>% group_by(state) %>%
summarise(sum_cured = sum(Cured),
sum_deaths = sum(Deaths),
sum_confirmed = sum(Confirmed),
deaths_perc = round(sum(Deaths)/sum(Confirmed)*100, digits = 2)) %>%
filter(deaths_perc != 0) %>%
ggplot(mapping = aes(x = reorder(state, deaths_perc), y = deaths_perc)) +
geom_bar(mapping = aes(fill = state), stat = "identity", show.legend = FALSE) +
coord_flip() +
xlab("States/Union Territories") +
ylab("% of Deaths/Confirmed") +
ggtitle("Case Fatality Rate - % of Deaths/Confirmed Cases") +
scale_fill_viridis_d() + theme_minimal()
p_death
# COVID 19 India - % of Cured/Confirmed Cases
p_cured <- tbl_deaths_percentage %>% group_by(state) %>%
summarise(sum_cured = sum(Cured),
sum_deaths = sum(Deaths),
sum_confirmed = sum(Confirmed),
cured_perc = round(sum(Cured)/sum(Confirmed)*100, digits = 2)) %>%
filter(cured_perc != 0) %>% mutate(rown = row_number(desc(cured_perc))) %>% filter(rown <= 25) %>%
ggplot(mapping = aes(x = reorder(state, cured_perc), y = cured_perc)) +
geom_bar(mapping = aes(fill = state), stat = "identity", show.legend = FALSE) +
coord_flip() +
xlab("States/Union Territories") +
ylab("% of Cured/Confirmed") +
ggtitle("Case Cured Rate - % of Cured/Confirmed Cases") +
scale_fill_viridis_d() + theme_minimal()
p_cured
tbl_state_testing_details <- testing
tbl_state_testing_details <- transmute(tbl_state_testing_details,
Date = Date,
State = State,
TotalSamples = replace_na(TotalSamples, 0),
Negative = replace_na(Negative, 0),
Positive = replace_na(Positive, 0)
)
p_testing_details <- tbl_state_testing_details %>% filter(TotalSamples != 0) %>% group_by(State) %>%
filter(Date == max(Date)) %>%
ungroup() %>% transmute(
Date = Date,
State = State,
Negative = ifelse(Negative == 0, TotalSamples - Positive, Negative),
Positive = ifelse(Positive == 0, TotalSamples - Negative, Positive),
TotalSamples = Negative + Positive
) %>%
pivot_longer(c(Negative, Positive), names_to = "type", values_to = "Samples") %>%
ggplot(mapping = aes(x = reorder(State, desc(TotalSamples)), y = Samples)) +
geom_col(mapping = aes(fill = type), position = position_stack(reverse = TRUE), show.legend = TRUE) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
coord_flip() +
ylab("Total Samples Tested") +
xlab("State") +
ggtitle("Testing Volumes by States") +
scale_fill_manual(values = c("orange", "red"))
p_testing_details
p_ratio_positive_tests <- tbl_state_testing_details %>% filter(TotalSamples != 0, Positive != 0) %>% group_by(State) %>%
filter(Date == max(Date)) %>%
ungroup() %>%
mutate(Positive_test_ratio = round(Positive/TotalSamples, digits = 2),
rown = row_number(desc(Positive_test_ratio))) %>%
filter(rown <= 20) %>%
ggplot(mapping = aes(x = reorder(State, desc(-Positive_test_ratio)), y = Positive_test_ratio)) +
geom_bar(mapping = aes(fill = State), stat = "identity", show.legend = FALSE) +
coord_flip() +
ylab("Ratio of Positive Samples Tested") +
xlab("State") +
ggtitle("Test positivity by State") +
scale_fill_viridis_d()
p_ratio_positive_tests
vaccine_na <- subset(vaccine, !is.na(Total.Doses.Administered))
vaccine_na$Updated.On <- as.Date(vaccine_na$Updated.On,format="%d/%m/%y")
vaccine_na <- vaccine_na %>% filter(State != 'India')
top_vaccine <- vaccine_na %>%
filter(Updated.On ==max(Updated.On )) %>%
select(State,Total.Doses.Administered) %>%
arrange(desc(Total.Doses.Administered)) %>%
top_n(5)
## Selecting by Total.Doses.Administered
tv <- top_vaccine[['State']]
tv[6] <- 'Kerala'
vaccine_na <- rename(vaccine_na,Date = Updated.On)
vaccine_na %>%
filter(State %in% tv) %>%
ggplot(aes(x=Date,y=Total.Doses.Administered)) + geom_line(aes(color=State))+
labs(title="Time Series for Doses Administered")
Now Let’s looking into age-wise and gender wise distribution of vaccination accross the country.
Now let’s analyze the overall distribution of vaccine
vaccination_data <- vaccine_na %>%
group_by(Date) %>%
summarise(Date,tot = sum(Total.Doses.Administered),
tot_cv=sum(Covaxin..Doses.Administered.),
tot_cs=sum(CoviShield..Doses.Administered.),
tot_m=sum(Male..Doses.Administered.),
tot_f=sum(Female..Doses.Administered.),
tot_t=sum(Transgender..Doses.Administered.),
tot_i=sum(Total.Individuals.Vaccinated)) %>%
summarise(Total_dose=mean(tot),
Total_covaxi = mean(tot_cv),
Total_covis =mean(tot_cs),
Total_Male = mean(tot_m),
Total_Female = mean(tot_f),
Total_Transgender = mean(tot_t),
Total_vaccinated = mean(tot_i))
## `summarise()` has grouped output by 'Date'. You can override using the `.groups`
## argument.
vaccination_data %>%
ggplot(aes(x=Date)) + geom_area(aes(y=Total_dose,color='green'),fill='green',alpha=.3) +
geom_area(aes(y=Total_Male,color='blue'),fill='blue',alpha=.3) +
geom_area(aes(y=Total_Female,color='red'),fill='red',alpha=.3) +
geom_area(aes(y=Total_Transgender,color='yellow'),fill='black',alpha=1) +
labs(title="Time series for Vaccinated") +
xlab(label ="Time Period") +
ylab(label="Total Vaccinated") +
scale_y_continuous(labels = scales :: number_format(accuracy=1))+
theme(legend.position="right")+
scale_color_identity(name = "Legend",
breaks = c("green", "blue", "red","yellow"),
labels = c("Total Vaccinated", "Men", "Women","Transgender"),
guide = "legend")
vaccine_bar <- vaccine_na %>%
filter(Date =="2020-03-16") %>%
select(State, X18.44.Years.Individuals.Vaccinated., X45.60.Years.Individuals.Vaccinated., X60..Years.Individuals.Vaccinated.)
vaccine_bar <- vaccine_bar %>% pivot_longer(cols = starts_with("X"), names_to = "Age Group", values_to = "value")
vaccine_bar <- subset(vaccine_bar, !is.na(value))
vaccine_bar %>%
filter(State %in% tv) %>%
ggplot(aes(x=State,y=value,fill=`Age Group`))+geom_bar(stat='identity',position = 'fill') +
scale_fill_discrete(name='Age Group',
breaks=c('X18.44.Years.Individuals.Vaccinated.', 'X45.60.Years.Individuals.Vaccinated.','X60..Years.Individuals.Vaccinated.'),
labels=c('18 to 44','44 to 60','>60')) +
ylab('Percentage') +
theme_classic()+
labs(title='Age group distribution for')
Let’s look at some statistical modelling techniques as a last stage in the research to see if we can forecast or estimate certain variables. We can use modelling in a variety of ways here. Time series analysis/forcasting is one prominent method. However, because time series forecasting methods are outside the scope of this study, we are disregarding them. Another method is to anticipate the value by determining the relationship between several factors. So we’re trying to see whether we can predict the death count by looking at other variables such as confirmed counts, countries, and so on. For this, we are trying to utilize linear model from “statistical Modeling” library and CART(Classification and Regression Tree).
library(statisticalModeling)
#model <- lm(net~age, data = Runners)
data_model <- world_data_by_countries
smp_size <- floor(0.75 * nrow(data_model))
set.seed(123)
train_ind <- sample(seq_len(nrow(data_model)), size = smp_size)
train <- data_model[train_ind, ]
test <- data_model[-train_ind, ]
model1 <- lm(death~confirmed, data = train)
result = evaluate_model(model1, data = test)
fmodel(model1)
cat("Assessing Prediction Performance:", mean((result$death - result$model_output) ^ 2, na.rm = TRUE))
## Assessing Prediction Performance: 1263026095
library(statisticalModeling)
#model <- lm(net~age, data = Runners)
data_model <- world_data_by_countries
smp_size <- floor(0.75 * nrow(data_model))
set.seed(123)
train_ind <- sample(seq_len(nrow(data_model)), size = smp_size)
train <- data_model[train_ind, ]
test <- data_model[-train_ind, ]
model1 <- lm(death~confirmed+Country.Region, data = train)
result = evaluate_model(model1, data = test)
fmodel(model1)
cat("Assessing Prediction Performance:", mean((result$death - result$model_output) ^ 2, na.rm = TRUE))
## Assessing Prediction Performance: 106289582
library(statisticalModeling)
#model <- lm(net~age, data = Runners)
data_model <- world_data_by_countries
smp_size <- floor(0.75 * nrow(data_model))
set.seed(123)
train_ind <- sample(seq_len(nrow(data_model)), size = smp_size)
train <- data_model[train_ind, ]
test <- data_model[-train_ind, ]
model1 <- lm(death~confirmed+Country.Region, data = train)
result = evaluate_model(model1, data = test)
fmodel(model1)
cat("Assessing Prediction Performance:", mean((result$death - result$model_output) ^ 2, na.rm = TRUE))
## Assessing Prediction Performance: 106289582
library(rpart)
rpart_1<-rpart(death~confirmed,data=train,cp=0.02)
result = evaluate_model(rpart_1, data = test)
fmodel(rpart_1)
cat("Assessing Prediction Performance:", mean((result$death - result$model_output) ^ 2, na.rm = TRUE))
## Assessing Prediction Performance: 1422784267
library(rpart)
rpart_2<-rpart(death~confirmed,data=train,cp=0.000002)
result = evaluate_model(rpart_2, data = test)
fmodel(rpart_2)
cat("Assessing Prediction Performance:", mean((result$death - result$model_output) ^ 2, na.rm = TRUE))
## Assessing Prediction Performance: 1303754682
library(rpart)
rpart_3<-rpart(death~confirmed+Country.Region,data=train,cp=0.0002)
result = evaluate_model(rpart_3, data = test)
fmodel(rpart_3)
cat("Assessing Prediction Performance:", mean((result$death - result$model_output) ^ 2, na.rm = TRUE))
## Assessing Prediction Performance: 191932715
So that brings us to the conclusion of our investigation. Before we wrap up, let’s take a quick look at everything we’ve done so far, what impact it’s had, and how it’s helping us create bigger and better covid resistance. We started by looking at worldwide covid data. We preprocessed and combined the data in the format we needed after noticing that it wasn’t in the appropriate format for our research. Then we began experimenting with various data research methods in order to gain insight into how Covid began, where regions it impacted, what impact it had over time, how it impacted in different seasons/months, and so on. We jotted down all of our observations in the key takeaways section at the appropriate locations. We then used a different dataset to look at the effects of covid on the population as well as the covid situation that occurred in our country. We tried our hardest to examine and investigate the effects of pandemics, as we said in our objectives.
This analysis, like all other analyses, has some limitations. The absence of data is one of the most significant limitations. We discovered a significant missing in the covid-19 recovery data. Within the recorded data, we’ve also noticed some inconsistency. We also know that some countries have been accused of under reporting, which could lead to some incorrect interpretations of data.
|